Controlling an apparatus based on speech

ABSTRACT

An apparatus with a speech control unit includes a microphone array having multiple microphones for receiving respective audio signals, and a beam forming module for extracting a speech signal of a user, from the audio signals. A keyword recognition system recognizes a predetermined keyword that is spoken by the user and which is represented by a particular audio signal and is arranged to control the beam forming module, on basis of tie recognition. A speech recognition unit creates an instruction for the apparatus based on recognized speech items of the speech signal. As a consequence, the speech control unit is more selective for those parts of the audio signals for speech recognition which correspond to speech items spoken by the user.

The invention relates to a speech control unit for controlling anapparatus on basis of speech, comprising:

-   -   a microphone array, comprising multiple microphones for        receiving respective audio signals;    -   a beam forming module for extracting a speech signal of a user,        from the audio signals as received by the microphones, by means        of enhancing first components of the audio signals which        represent an utterance originating from a first orientation of        the user relative to the microphone array; and    -   a speech recognition unit for creating an instruction for the        apparatus based on recognized speech items of the speech signal.

The invention further relates to an apparatus comprising:

-   -   such a speech control unit for controlling the apparatus on        basis of speech; and    -   processing means for execution of the instruction being created        by the speech control unit.

The invention further relates to a method of controlling an apparatus onbasis of speech, comprising:

-   -   receiving respective audio signals by means of a microphone        array, comprising multiple microphones;    -   extracting a speech signal of a user, from the audio signals as        received by the microphones, by means of enhancing first        components of the audio signals which represent an utterance        originating from a first orientation of the user relative to the        microphone array; and    -   creating an instruction for the apparatus based on recognized        speech items of the speech signal.

Natural spoken language is a preferred means for human-to-humancommunication. Because of recent advances in automatic speechrecognition, natural spoken language is emerging as an effective meansfor human-to-machine communication. The user is being liberated frommanipulating a keyboard and mouse, which requires great hand/eyecoordination. This hands-free advantage of human to machinecommunication through speech recognition is particularly desired insituations where the user must be free to use his/her eyes and hands,and to move about unencumbered while talking. However the user is stillencumbered in present systems by hand-held, body-worn, or tetheredmicrophone equipment, e.g. headset microphone, which captures audiosignals and provides input to the speech recognition unit. This isbecause most speech recognition units work best with a close-talkingmicrophone input, e.g. with the user and microphone in close proximity.When they are deployed in “real-world” environments, the performance ofknown speech recognition units typically degrades. The degradation isparticularly severe when the user is far from the microphone. Roomreverberation and interfering noise contribute to the degradedperformance.

In general it is uncomfortable to wear the headset microphone on a headfor any extended period of time, while the hand microphone can limitfreedom of the user as it occupies the user's hands, and there has beena demand for a speech input scheme that allows more freedom to the user.A microphone array in combination with a beam forming module appears tobe a good approach that can resolve the conventionally encounteredinconvenience described above. The microphone array is a set ofmicrophones which are arranged at different positions. The multipleaudio signals received by the respective microphones of the array areprovided to the beam forming module. The beam forming module has to becalibrated, i.e. an orientation or position of a particular sound sourcerelative to the microphone array has to be estimated. The particularsound source might be the source in the environment of the microphonearray which generates sound having parameters corresponding topredetermined parameters, e.g. comprising predetermined frequenciesmatching with human voice. However, often the calibration is based onthe loudest sound, i.e. the particular sound source generates theloudest sound. For example, a beam forming module can be calibrated onbasis of the user who is speaking loudly, compared to other users in thesame environment. A sound source direction or position can be estimatedfrom time differences among signals from different microphones, using adelay sum array method or a method based on the cross-correlationfunction as disclosed in: “Knowing Who to Listen to in SpeechRecognition: Visually Guided Beamforming”, by U. Bub, et al. ICASSP'95,pp. 848-851, 1995. A parametric method estimating the sound sourceposition (or direction) is disclosed in S. V. Pillai: “Array SignalProcessing”, Springer-Verlag, New York, 1989.

After being calibrated, i.e. the current orientation being estimated,the beam forming module is arranged to enhance sound originating from adirection corresponding to the current direction and to reduce noise, bysynthetic processing of outputs of these microphones. It is assumed thatthe output of the beam forming module is a clean signal that isappropriate to be provided to a speech recognition unit resulting in arobust speech recognition. This means that the components of the audiosignals are processed such that the speech items of the user can beextracted.

An embodiment of a system comprising a microphone array, a beam formingmodule and a speech recognition unit is known from European PatentApplication EP 0795851 A2. The Application discloses that a sound sourceposition or direction estimation and a speech recognition can beachieved with the system. The disadvantage of this system is that itdoes not work appropriate in a multi user situation. Suppose that thesystem has been calibrated for a first position of the user. Then theuser starts moving. The system should be re-calibrated first to be ableto recognize speech correctly. The system requires audio signals, i.e.the user has to speak something, as input for the calibration. However,if in between another user starts speaking, then the re-calibration willnot provide the right result: the system will get tuned to the otheruser.

It is an object of the invention to provide a speech control unit of thekind described in the opening paragraph which is arranged to recognizespeech of a user who is moving in an environment in which other usersmight speak too.

This object of the invention is achieved in that the speech control unitcomprises a keyword recognition system for recognition of apredetermined keyword that is spoken by the user and which isrepresented by a particular audio signal and the speech control unitbeing arranged to control the beam forming module, on basis of therecognition of the predetermined keyword, in order to enhance secondcomponents of the audio signals which represent a subsequent utteranceoriginating from a second orientation of the user relative to themicrophone array. The keyword recognition system is arranged todiscriminate between audio signals related to utterances representingthe predetermined keyword and to other utterances which do not representthe predetermined keyword. The speech control unit is arranged tore-calibrate if it receives sound corresponding to the predeterminedkeyword, from a different orientation. Preferably this sound has beengenerated by the user who initiated an attention span (see also FIG. 3)of the apparatus to be controlled. There will be no re-calibration ifthe predetermined keyword has not been recognized. As a consequence,speech items spoken from another orientation and which are not precededby the predetermined keyword, win be discarded.

In an embodiment of the speech control unit according to the invention,the keyword recognition system is arranged to recognize thepredetermined keyword that is spoken by another user and the speechcontrol unit being arranged to control the beam forming module, on basisof this recognition, in order to enhance third components of the audiosignals which represent another utterance originating from a thirdorientation of the other user relative to the microphone array. Thisembodiment of the speech control unit is arranged to re-calibrate onbasis of the recognition of the predetermined keyword spoken by anotheruser. Besides, following one particular user, this embodiment isarranged to calibrate on basis of sound from multiple users. That meansthat only authorized users, i.e. those who have authorization to controlthe apparatus because they have spoken the predetermined keyword, arerecognized as such and hence only speech items from them will beaccepted for the creation of instructions for the apparatus.

In an embodiment of the speech control unit according to the invention,a first one of the microphones of the microphone array is arranged toprovide the particular audio signal to the keyword recognition system.In other words, the particular audio signal which is used for keywordrecognition corresponds to one of the audio signals as received by themicrophones of the microphone array. The advantage is that no additionalmicrophone is required.

In an embodiment of the speech control unit according to the invention,the beam forming module is arranged to determine a first position of theuser relative to the microphone array. Besides orientation, also adistance between the user and the microphone array is determined. Theposition is calculated on basis of the orientation and distance. Anadvantage of this embodiment according to the invention is that thespeech control unit is arranged to discriminate between soundsoriginating from users who are located in front of each other.

It is a further object of the invention to provide an apparatus of thekind described in the opening paragraph which is arranged to becontrolled by a user who is moving in an environment in which otherusers might speak too.

This object of the invention is achieved that the apparatus comprisesthe speech control unit as claimed in claim 1.

An embodiment of the apparatus according to the invention is arranged toshow that the predetermined keyword has been recognized. An advantage ofthis embodiment according to the invention is that the user getsinformed about the recognition.

An embodiment of the apparatus according to the invention which isarranged to show that the predetermined keyword has been recognized,comprises audio generating means for generating an audio signal. Bygenerating an audio signal, e.g. “Hello” it is clear for the user thatthe apparatus is ready to receive speech items from the user. Thisconcept is also known as auditory greeting.

It is a further object of the invention to provide a method of the kinddescribed in the opening paragraph which enables to recognize speech ofa user who is moving in an environment in which other users might speaktoo.

This object of the invention is achieved that the method ischaracterized in comprising recognition of a predetermined keyword thatis spoken by the user based on a particular audio signal and controllingthe extraction of the speech signal of the user, on basis of therecognition, in order to enhance second components of the audio signalswhich represent a subsequent utterance originating from a secondorientation of the user relative to the microphone array.

Modifications of speech control unit and variations thereof maycorrespond to modifications and variations thereof of the apparatusdescribed and of the method described.

These and other aspects of the speech control unit, of method and of theapparatus according to the invention will become apparent from and willbe elucidated with respect to the implementations and embodimentsdescribed hereinafter and with reference to the accompanying drawings,wherein:

FIG. 1 schematically shows an embodiment of the speech control unitaccording to the invention;

FIG. 2 schematically shows an embodiment of the apparatus according tothe invention; and

FIG. 3 schematically shows the creation of an instruction on basis of anumber of audio signals.

Same reference numerals are used to denote similar parts throughout theFigures.

FIG. 1 schematically shows an embodiment of the speech control unit 100according to the invention. The speech control unit 100 is arranged toprovide instructions to the processing unit 202 of the apparatus 200.These instructions are provided at the output connector 122 of thespeech control unit 100, which comprises:

-   -   a microphone array, comprising multiple microphones 102, 104,        106, 108 and 110 for receiving respective audio signals 103,        105, 107, 109 and 111;    -   a beam forming module 116 for extracting a clean, i.e. speech,        signal 117 of a user U1, from the audio signals 103, 105, 107,        109 and 111 as received by the microphones 102, 104, 106, 108        and 110;    -   a keyword recognition system 120 for recognition of a        predetermined keyword that is spoken by the user and which is        represented by a particular audio signal 111 and being arranged        to control the beam forming module, on basis of the recognition;        and    -   a speech recognition unit 118 for creating an instruction for        the apparatus 200 based on recognized speech items of the speech        signal 117.

The working of the speech control unit 100 is as follows. It is assumedthat initially the speech control unit 100 is calibrated on basis ofutterances of user U1 being at position P1. The result is that the beamforming module 116 of the speech control unit 100 is “tuned” to soundoriginating from directions which substantially match direction α. Soundfrom directions which differ from direction α with more than apredetermined threshold, is disregarded for speech recognition. E.g.speech of user U2, being located at position P2 with a direction φrelative to the microphone array is neglected. Preferably, the speechcontrol unit 100 is sensitive to sound with voice characteristics, i.e.speech, and is insensitive to others sounds. For instance the sound ofthe music as generated by the speaker S1, which is located in thevicinity of user U1 is filtered out by the beam forming module 116.

Suppose that user U1 has moved to position P2, corresponding to anorientation β relative to the microphone array. Without re-calibrationof the speech control unit 100, or more particular the beam formingmodule 116, the recognition of speech items probably would fail. Howeverthe speech control unit 100 will get calibrated again when user U1starts his speaking with the predetermined keyword. The predeterminedkeyword as spoken by user U1 is recognized and used for there-calibration. Optionally further words spoken by the first user U1which succeed the keyword are also applied for the re-calibration. Ifanother user, e.g. user U2, starts speaking without first speaking thepredetermined keyword then his/her utterances are recognized as notrelevant and skipped for the re-calibration. As a consequence the speechcontrol unit 100 is arranged to stay “tuned” to user U1 while he/she ismoving. Speech signals of this user U1 are extracted from the audiosignals 103, 105, 107, 109 and 111 and are basis for speech recognition.Other sounds are not taken into account for the control of theapparatus.

Above it is explained that the speech control unit 100 is arranged to“follow” one specific user U1. This user might be the user who initiatedthe attention span of the speech control unit. Optionally, the speechcontrol unit 100 is arranged to get subsequently tuned to a number ofusers.

In FIG. 1 is depicted that the microphone 110 is connected to both thekeyword recognition system 120 and the beam forming module 116. This isoptional, that means that an additional microphone could have been used.The keyword recognition system 120 might be comprised by the speechrecognition unit 118. The components 116-120 of the speech control unit100 and the processing unit 202 of the apparatus 200 may be implementedusing one processor. Normally, both functions are performed undercontrol of a software program product. During execution, normally thesoftware program product is loaded into a memory, like a RAM, andexecuted from there. The program may be loaded from a background memory,like a ROM, hard disk, or magnetically and/or optical storage, or may beloaded via a network like Internet. Optionally an application specificintegrated circuit provides the disclosed functionality.

FIG. 2 schematically shows an embodiment of the apparatus 200 accordingto the invention. The apparatus 200 optionally comprises a generatingmeans 206 for generating an audio signal. By generating an audio signal,e.g. “Hello” it is clear for the user that the apparatus is ready toreceive speech items from the user. Optionally the generating means 206is arranged to generate multiple sounds: e.g. a first sound to indicatethat the apparatus is in a state of calibrating and a second sound toindicate that the apparatus is in a state of being calibrated and hencethe apparatus is in an active state of recognizing speech items. Thegenerating means 206 comprises a memory device for storage of sampledaudio signals, a sound generator and a speaker. Optionally, theapparatus also comprises a display device 204 for displaying a visualrepresentation of the state of the apparatus.

The speech control unit 100 according to the invention is preferablyused in a multi-function consumer electronics system, like a TV, set topbox, VCR, or DVD player, game box, or similar device. But it may also bea consumer electronic product for domestic use such as a washing orkitchen machine, any kind of office equipment like a copying machine, aprinter, various forms of computer work stations etc, electronicproducts for use in the medical sector or any other kind of professionaluse as well as a more complex electronic information system. Besidesthat, it may be a product specially designed to be used in vehicles orother means of transport, e.g. a car navigation system. Whereas, theword “multifunction electronic system” as used in the context of theinvention may comprise a multiplicity of electronic products fordomestic or professional use as well as more complex informationsystems, the number of individual functions to be controlled by themethod would normally be limited to a reasonable level, typically in therange from 2 to 100 different functions. For a typical consumerelectronic product like a TV or audio system, where only a more limitednumber of functions need to be controlled, e.g. 5 to 20 functions,examples of such functions may include volume control including muting,tone control, channel selection and switching from inactive or stand-bycondition to active condition and vice versa, which could be initiated,by control commands such as “louder”, “softer”, “mute”, “bass” “treble”“change channel”, “on”, “off”, “stand-by” etcetera.

In the description it is assumed that the speech control unit 100 islocated in the apparatus 200 being controlled. It will be appreciatedthat this is not required and that the control method according to theinvention is also possible where several devices or apparatus areconnected via a network (local or wide area), and the speech controlunit 100 is located in a different device then the device or apparatusbeing controlled.

FIG. 3 schematically shows the creation of an instruction 318 on basisof a number of audio signals 103, 105, 107, 109 and 111 as received bythe microphones 102, 104, 106, 108 and 110. From the audio signals thespeech items 304-308 are extracted. The speech items 304-308 arerecognized and voice commands 312-316 are assigned to these speech items304-308. The voice commands 312-316 are “Bello”, “Channel” and “Next”,respectively. An instruction “Increase_Frequency_Band”, which isinterpretable for the processing unit-202 is created based on thesevoice commands 312-316.

To avoid that conversations or utterances not intended for controllingthe apparatus are recognized and executed, the speech control unit 100optionally requires the user to activate the speech control unit 100resulting in a time span, or also called attention span during which thespeech control unit 100 is active. Such an activation may be performedvia voice, for instance by the user speaking a keyword, like “TV” or“Device-Wake-up”. Preferably the keyword for initiating the attentionspan is the same as the predetermined keyword for re-calibrating thespeech control unit.

By using an anthropomorphic character a barrier for interaction isremoved: it is more natural to address the character instead of theproduct, e.g. by saying “Bello” to a dog-like character. Moreover, aproduct can make effective use of one object with several appearances,chosen as a result of several state elements. For instance, a basicappearance like a sleeping animal can be used to show that the speechcontrol unit 100 is not yet active. A second group of appearances can beused when the speech control unit 100 is active, e.g. awake appearancesof the animal. The progress of the attention span can then, forinstance, be expressed, by the angle of the ears: fully raised at thebeginning of the attention span, fully down at the end. The similarappearances can also express whether or not an utterance was understood:an “understanding look” versus a “puzzled look”. Also audible feedbackcan be combined, like a “glad” bark if a speech item has beenrecognized. A user can quickly grasp the feedback on all such systemelements by looking at the one appearance which represents all theseelements. E.g. raised ears and an “understanding look”, or lowered earsand a “puzzled look”. The position of the eyes of the character can beused to feedback to the user where the system is expecting the user tobe.

Once a user has started an attention span the apparatus, i.e. the speechcontrol unit 100 is in a state of accepting further speech items. Thesespeech items 304-308 will be recognized and associated with voicecommands 312-316. A number of voice commands 312-316 together will becombined to one instruction 318 for the apparatus. E.g. a first speechitem is associated with “Bello”, resulting in a wake-up of thetelevision. A second speech item is associated with the word “channel”and a third speech item is associated with the word “next”. The resultis that the television will switch, i.e. get tuned to a nextbroadcasting channel. If another user starts talking during theattention span of the television just initiated by the first user, thenhis/her utterances will be neglected.

It should be noted that the above-mentioned embodiments illustraterather than limit the invention and that those skilled in the art willbe able to design alternative embodiments without departing from thescope of the appended claims. In the claims, any reference signs placedbetween parentheses shall not be constructed as limiting the claim. Theword ‘comprising’ does not exclude the presence of elements or steps notlisted in a claim. The word “a” or “an” preceding an element does notexclude the presence of a plurality of such elements. The invention canbe implemented by means of hardware comprising several distinct elementsand by means of a suitable programmed computer. In the unit claimsenumerating several means, several of these means can be embodied by oneand the same item of hardware.

1. A system for controlling an apparatus on basis of speech, comprising:a microphone array, comprising multiple microphones for receivingrespective audio signals; a beam forming module for extracting a speechsignal of a user, from the audio signals as received by the microphones,by means of enhancing first components of the audio signals whichrepresent an utterance originating from a first position of the userrelative to the microphone array; a speech recognition unit for creatingan instruction for the apparatus based on recognized speech items of thespeech signal; and a keyword recognition system for recognition of arepresented by a particular audio signal; a speech control unit beingarranged to control the beam forming module, on basis of the recognitionof the predetermined keyword, in order to enhance second components ofthe audio signals which represent a subsequent utterance originatingfrom a second position of the user relative to the microphone array;wherein the recognition of the predetermined keyword at the secondposition calibrates the beam forming module to follow the user from thefirst position to the second position so that the subsequent utteranceoriginating from the second position are accepted while utterances ofother users at other positions are discarded, the second positionincluding an orientation and a distance relative to the microphonearray, and the speech control unit being configured to discriminatebetween sounds originating from users who are located in front of eachother relative the microphone array; wherein the subsequent utteranceoriginating from the second position will be discarded if not precededby the recognition of the predetermined keyword originating from thesecond position; and wherein the keyword recognition system is arrangedto recognize the predetermined keyword that is spoken by another userand the speech control unit being arranged to control the beam formingmodule, on basis of this recognition, in order to enhance thirdcomponents of the audio signals which represent another utteranceoriginating from a third orientation of the other user relative to themicrophone array.
 2. The system as claimed in claim 1, wherein a firstone of the microphones of the microphone array is arranged to providethe particular audio signal to the keyword recognition system.
 3. Thesystem as claimed in claim 1, wherein the beam forming module isarranged to determine a first position of the user relative to themicrophone array.
 4. An apparatus comprising: a system for controllingthe apparatus on basis of speech as claimed in claim 1; and processingmeans for execution of the instruction being created by the speechcontrol unit.
 5. The apparatus as claimed in claim 4, the apparatusarranged to show that the predetermined keyword has been recognized. 6.The apparatus as claimed in claim 5, further comprising audio generatingmeans for generating an audio signal in order to show that thepredetermined keyword has been recognized.
 7. A consumer electronicssystem comprising the apparatus as claimed in claim
 4. 8. The system ofclaim 1, wherein the user is informed by indications that the speechcontrol unit is not active, is in an active state and ready to receivethe utterance, or is in a state of calibration.
 9. The system of claim8, wherein the indications include an animal in a sleeping stateindicating that the speech control unit is not active, and in an awakestate indicating that the speech control unit is in the active state.10. The system of claim 9, wherein progress of the active state isindicated by an angle of ears of the animal.
 11. The system of claim 10,wherein the ears are fully raised at a beginning of the active state,and fully down at an end of the active state.
 12. The system of claim 9,wherein the animal has an understanding look when the utterance isrecognized and a puzzled look when the utterance is not recognized. 13.The system of claim 1, wherein the beam forming module is connected tothe microphone array, and the keyword recognition system is connected toone microphone of the microphone array for detecting the predeterminedkeyword, the keyword recognition system being further connected to thebeam forming module for providing the detected predetermined keyword tothe beam forming module.
 14. A method of controlling an apparatus onbasis of speech, comprising the acts of: receiving respective audiosignals by means of a microphone array, comprising multiple microphones;extracting a speech signal of a user, from the audio signals as receivedby the microphones, by means of enhancing first components of the audiosignals which represent an utterance originating from a first positionof the user relative to the microphone array; recognizing apredetermined keyword that is spoken by based on a particular audiosignal and controlling the extraction of the speech signal of the user,on basis of the recognition of the predetermined keyword, in order toenhance second components of the audio signals which represent asubsequent utterance originating from a second position of the userrelative to the microphone array while discarding utterances of otherusers at other positions, the second position including an orientationand a distance relative co the microphone array so that soundsoriginating from users who are located in front of each other relativethe microphone array are discriminated; creating an instruction for theapparatus based on recognized speech items of the speech signal;discarding the subsequent utterance originating from the second positionif not preceded by the recognition of the predetermined keywordoriginating from the second position; and recognizing the predeterminedkeyword that is spoken by another and extracting a speech signal of theuser, on basis of this recognition, in order to enhance third componentsof the audio signals which represent another utterance originating froma third orientation of the other user relative to the microphone array.15. The method of claim 14, further comprising the act of informing theuser by indications that the apparatus is not active, is in an activestate and ready to receive the utterance, or is in a state ofcalibration.
 16. The method of claim 15, wherein the indications includean animal in a sleeping state indicating that the speech control unit isnot active, and in an awake state indicating that the speech controlunit is in the active state.
 17. The method of claim 16, whereinprogress of the active state is indicated by an angle of ears of theanimal.
 18. The method of claim 17, wherein the ears are fully raised ata beginning of the active state, and fully down at an end of the activestate.
 19. The method of claim 16, wherein the animal has anunderstanding look when the utterance is recognized and a puzzled lookwhen the utterance is not recognized.